Combining and Reshaping Data

Combination of data in pandas is performed by concatenating two sets of data, where data is combined simply along either axes but without regard to relationships in the data. Or data can be combined using relationships in the data by using a pandas capability referred to as merging, which provides join operations that are similar to those in many relational databases.

Reshaping - There are three primary means of reshaping the data.

  • Pivoting - restructure pandas data similarly to how spreadsheets pivot data by creating new index levels and moving data into columns based upon values.
  • Stacking and Unstacking - Similar to pivoting but allow us to pivot data organised with multiple level of indexes
  • Melting - Allows to restructure data into unique ID-variable-measurement combinations that are required for many statistical analyses.

Setting up the Jupyter Notebook


In [1]:
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
import datetime

# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html',False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)

Concatenating Data


In [2]:
# two series objects to concatenate
s1 = pd.Series(np.arange(0,3))
s2 = pd.Series(np.arange(5,8))
s1


Out[2]:
0    0
1    1
2    2
dtype: int64

In [3]:
s2


Out[3]:
0    5
1    6
2    7
dtype: int64

concat method


In [4]:
# concatenate them
pd.concat([s1,s2])


Out[4]:
0    0
1    1
2    2
0    5
1    6
2    7
dtype: int64

In [5]:
# dataframe objects can also be concatenated
# create two dataframe objects to concatenate
# using the same index labels and column names
# but different values
df1 = pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'])
df2 = pd.DataFrame(np.arange(9,18).reshape(3,3),columns=['a','b','c'])
pd.concat([df1,df2])


Out[5]:
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
0   9  10  11
1  12  13  14
2  15  16  17

In [6]:
# demonstrate concatenating two dataframe objects
# with different columns
df1 = pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'])
df2 = pd.DataFrame(np.arange(9,18).reshape(3,3),columns=['a','c','d'])
pd.concat([df1,df2])


Out[6]:
    a    b   c     d
0   0  1.0   2   NaN
1   3  4.0   5   NaN
2   6  7.0   8   NaN
0   9  NaN  10  11.0
1  12  NaN  13  14.0
2  15  NaN  16  17.0

In [9]:
# concat the two objects, but create an index using the given keys
c = pd.concat([df1,df2],keys=['df1','df2'])
c


Out[9]:
        a    b   c     d
df1 0   0  1.0   2   NaN
    1   3  4.0   5   NaN
    2   6  7.0   8   NaN
df2 0   9  NaN  10  11.0
    1  12  NaN  13  14.0
    2  15  NaN  16  17.0

In [10]:
# we can extract data originating from the first or second source dataframe
c.loc['df2']


Out[10]:
    a   b   c     d
0   9 NaN  10  11.0
1  12 NaN  13  14.0
2  15 NaN  16  17.0

The pd.concat() function also allows you to specify the axis on which to apply the concatenation.


In [11]:
# concat df1 and df2 along columns
# aligns on row labels, has duplicate columns
pd.concat([df1,df2],axis=1)


Out[11]:
   a  b  c   a   c   d
0  0  1  2   9  10  11
1  3  4  5  12  13  14
2  6  7  8  15  16  17

In [12]:
# a new dataframe to merge with df1
# this has two common row labels(2,3)
# common columns (a) and one disjoint column
# in each (b in df1 and d in df2)
df3 = pd.DataFrame(np.arange(20,26).reshape(3,2),columns=['a','d'],index=[2,3,4])
df3


Out[12]:
    a   d
2  20  21
3  22  23
4  24  25

In [13]:
# concat them. Alignment is along row labels
# columns first from df1 and then df3, with duplicates.
# NaN filled in where those columns do not exist in the source
pd.concat([df1,df3],axis=1)


Out[13]:
     a    b    c     a     d
0  0.0  1.0  2.0   NaN   NaN
1  3.0  4.0  5.0   NaN   NaN
2  6.0  7.0  8.0  20.0  21.0
3  NaN  NaN  NaN  22.0  23.0
4  NaN  NaN  NaN  24.0  25.0

A concatenation of two or more DataFrame objects actually performs an outer join operation along the index labels on the axis opposite to the one specified. This makes the result of the concatenation similar to having performed a union of those index labels, and then data is filled based on the alignment of those labels to the source objects.

The type of join can be changed to an inner join and can be performed by specifying join='inner' as the parameter. The inner join then logically performs an intersection instead of a union.


In [14]:
# do an inner join instead of outer
# results in one row
pd.concat([df1,df3],axis=1,join='inner')


Out[14]:
   a  b  c   a   d
2  6  7  8  20  21

In [15]:
# add keys to the columns
df = pd.concat([df1,df2],axis=1,keys=['df1','df2'])
df


Out[15]:
  df1       df2        
    a  b  c   a   c   d
0   0  1  2   9  10  11
1   3  4  5  12  13  14
2   6  7  8  15  16  17

In [16]:
# retrieve the data that originated from the
# DataFrame with key = df2
df.loc[:,'df2']


Out[16]:
    a   c   d
0   9  10  11
1  12  13  14
2  15  16  17

append method


In [17]:
# append does a concatenate along axis = 0
# duplicate row index labels can result
df1.append(df2)


Out[17]:
    a    b   c     d
0   0  1.0   2   NaN
1   3  4.0   5   NaN
2   6  7.0   8   NaN
0   9  NaN  10  11.0
1  12  NaN  13  14.0
2  15  NaN  16  17.0

In [18]:
# remove duplicates in the result index by ignoring the
# index labels in the source DataFrame objects
df1.append(df2,ignore_index=True)


Out[18]:
    a    b   c     d
0   0  1.0   2   NaN
1   3  4.0   5   NaN
2   6  7.0   8   NaN
3   9  NaN  10  11.0
4  12  NaN  13  14.0
5  15  NaN  16  17.0

Merging and Joining Data

pandas allows the merging of pandas objects with database-like join operations using the pd.merge() function and the .merge() method of a DataFrame object. These joins are high performance and are performed in memory. A merge combines the data of two pandas objects by finding matching values in one or more columns or row indexes.


In [24]:
# these are customers data
customers = {'CustomerID':[10,11],'Name':['Mike','Marcia'],'Address':['Address for Mike','Address for Marcia']}
customers = pd.DataFrame(customers)
customers


Out[24]:
              Address  CustomerID    Name
0    Address for Mike          10    Mike
1  Address for Marcia          11  Marcia

In [25]:
# and these are the orders made by our customers
# they are related to customers by CustomerID
orders =  {'CustomerID':[10,11,10],'OrderDate':[datetime.date(2014,12,1),datetime.date(2014,12,1),datetime.date(2014,12,1)]}
orders = pd.DataFrame(orders)
orders


Out[25]:
   CustomerID   OrderDate
0          10  2014-12-01
1          11  2014-12-01
2          10  2014-12-01

In [26]:
# merge customers and orders so we can ship the items
customers.merge(orders)


Out[26]:
              Address  CustomerID    Name   OrderDate
0    Address for Mike          10    Mike  2014-12-01
1    Address for Mike          10    Mike  2014-12-01
2  Address for Marcia          11  Marcia  2014-12-01

pandas has done something magical for us here by being able to accomplish this with such a simple piece of code. What pandas has done is realized that our customers and orders objects both have a column named CustomerID. With this knowledge, it uses common values found in that column of both DataFrame objects to related the data in both and form the merged data based on inner join semantics.

  • Determines the columns in both customers and orders with common labels. These columns are treated as the keys to perform the join.
  • It creates a new DataFrame whose columns are the labels from the keys identified in step 1, followed by all of the non-key labels from both objects.
  • It matches values in the key columns of both DataFrame objects.
  • It then creates a row in the result for each set of matching labels.
  • It then copies the data from those matching rows from each source object into that respective row and columns of the result.
  • It assigns a new Int64Index to the result.

In [27]:
# sample data
left_data = {'Key1':['a','b','c'],'Key2':['x','y','z'],'lval1':[0,1,2]}
right_data = {'Key1':['a','b','c'],'Key2':['x','a','z'],'rval1':[6,7,8]}

left = pd.DataFrame(left_data,index=[0,1,2])
right = pd.DataFrame(right_data,index=[1,2,3])

left


Out[27]:
  Key1 Key2  lval1
0    a    x      0
1    b    y      1
2    c    z      2

In [28]:
right


Out[28]:
  Key1 Key2  rval1
1    a    x      6
2    b    a      7
3    c    z      8

In [29]:
# demonstrate merge without specifying columns to merge
# this will implicitly merge on all common columns
left.merge(right)


Out[29]:
  Key1 Key2  lval1  rval1
0    a    x      0      6
1    c    z      2      8

In [31]:
# demonstrate merge using an explicit column
# on needs the value to be in both DataFrame objects
left.merge(right,on='Key1')


Out[31]:
  Key1 Key2_x  lval1 Key2_y  rval1
0    a      x      0      x      6
1    b      y      1      a      7
2    c      z      2      z      8

In [32]:
# merge explicitly using two columns
left.merge(right,on=['Key1','Key2'])


Out[32]:
  Key1 Key2  lval1  rval1
0    a    x      0      6
1    c    z      2      8

In [33]:
# join on the row indices of both matrices
pd.merge(left,right,left_index=True,right_index=True)


Out[33]:
  Key1_x Key2_x  lval1 Key1_y Key2_y  rval1
1      b      y      1      a      x      6
2      c      z      2      b      a      7

This has identified that the index labels in common are 1 and 2, so the resulting DataFrame has two rows with these values and label in the index. pandas then creates a column in the result for every column in both objects and then copies the values.

As both DataFrame objects had a column with a identifcal name, key, the columns in the result have the _x and _y suffixes appended to them to identify the DataFrame they originated from.

Specifying the join semantics of a merge operation

The default type of join performed by pd.merge() is an inner join. To use another join method, the method of join to be used can be specified using the how parameter of the pd.merge() function. The valid options are:

  • inner - This is the intersection of keys from both DataFrame objects
  • outer - This is the union of keys from both DataFrame objects
  • left - This only uses keys from the left DataFrame
  • right - This only uses keys from the right DataFrame

In [34]:
# outer join, merges all matched data
# and fills unmatched items with NaN
left.merge(right,how='outer')


Out[34]:
  Key1 Key2  lval1  rval1
0    a    x    0.0    6.0
1    b    y    1.0    NaN
2    c    z    2.0    8.0
3    b    a    NaN    7.0

In [35]:
# left join, merges all matched data and only fills unmatched
# items from the left dataframe with NaN filled for the
# unmatched items in the result
# rows with labels 0 and 2
# match on key1 and key2 the row with label 1 is from left
left.merge(right,how="left")


Out[35]:
  Key1 Key2  lval1  rval1
0    a    x      0    6.0
1    b    y      1    NaN
2    c    z      2    8.0

In [36]:
# right join, merges all matched data and only fills unmatched
# items from the right with NaN filled for the unmatched items
# in the result
# rows with labels 0 and 1 match on key1 and key2
# the row with label 2 is from right
left.merge(right, how='right')


Out[36]:
  Key1 Key2  lval1  rval1
0    a    x    0.0      6
1    c    z    2.0      8
2    b    a    NaN      7

In [37]:
# join left with right (default method is outer)
# and since these DataFrame objects have duplicate column names
# we just specify lsuffix and rsuffix
left.join(right,lsuffix='_left',rsuffix='_right')


Out[37]:
  Key1_left Key2_left  lval1 Key1_right Key2_right  rval1
0         a         x      0        NaN        NaN    NaN
1         b         y      1          a          x    6.0
2         c         z      2          b          a    7.0

In [38]:
# join left with right with an inner join
left.join(right, lsuffix='_left',rsuffix='_right',how='inner')


Out[38]:
  Key1_left Key2_left  lval1 Key1_right Key2_right  rval1
1         b         y      1          a          x      6
2         c         z      2          b          a      7

Pivoting

Data is often stored in a stacked format, which is also referred to as record format; this is common in databases, .csv files and Excel spreadsheets. In a stacked format, the data is often not normalized and has repeated values in many columns, or values that should logically exists in other tables.


In [39]:
# read in accellerometer data
sensor_readings = pd.read_csv('../../data/accel.csv')
sensor_readings


Out[39]:
    interval axis  reading
0          0    X      0.0
1          0    Y      0.5
2          0    Z      1.0
3          1    X      0.1
4          1    Y      0.4
..       ...  ...      ...
7          2    Y      0.3
8          2    Z      0.8
9          3    X      0.3
10         3    Y      0.2
11         3    Z      0.7

[12 rows x 3 columns]

In [40]:
# extract X-axis readings
sensor_readings[sensor_readings['axis'] == 'X']


Out[40]:
   interval axis  reading
0         0    X      0.0
3         1    X      0.1
6         2    X      0.2
9         3    X      0.3

In [42]:
# pivot the data. Interval becomes the index, the columns are 
# the current axes values and use the readings as values
sensor_readings.pivot(index='interval',columns='axis',values='reading')


Out[42]:
axis        X    Y    Z
interval               
0         0.0  0.5  1.0
1         0.1  0.4  0.9
2         0.2  0.3  0.8
3         0.3  0.2  0.7

This has taken all of the distinct values from the axis column, and pivoted them into columns on the new DataFrame, while filling in values for the new columns from the appropriate rows and columns of the original DataFrame.

Stacking and Unstacking

Stacking pivots a level of column labels to the row index. Unstacking pivots a level of the row index into the column index.

One of the differences between stacking/unstacking and peforming a pivot is that unlike pivots the stack and unstack functions will be able to pivot specific levels of a hierarchical index. Also, where a pivot retains the same number of levels on an index, a stack and unstack will always increase the levels on the index of one of the axes and decrease the levels on the other axis.


In [46]:
# simple DataFrame with one column
df = pd.DataFrame({'a':[1,2]}, index={'one','two'})
df


Out[46]:
     a
one  1
two  2

Stacking will move one level of the columns index into a new level of the rows index. As our DataFrame only has one level, this collapses a DataFrame object into a Series object with a hierarchical row index:


In [45]:
# push the column to another level of the index
# the result is a Series where values are looked up through
# a multi-index
stacked1 = df.stack()
stacked1


Out[45]:
one  a    1
two  a    2
dtype: int64

In [47]:
# lookup one / a using just index via a tuple
stacked1[('one','a')]


Out[47]:
1

In [48]:
# DataFrame with two columns
df = pd.DataFrame({'a':[1,2],'b':[3,4]},index={'one','two'})
df


Out[48]:
     a  b
one  1  3
two  2  4

In [49]:
# push the two columns into a single level of index
stacked2 = df.stack()
stacked2


Out[49]:
one  a    1
     b    3
two  a    2
     b    4
dtype: int64

In [50]:
# lookup value with index one / b
stacked2[('two','b')]


Out[50]:
4

Unstacking


In [51]:
# make two copies of the sensor data, one for each user
user1 = sensor_readings.copy()
user2 = sensor_readings.copy()
# add names to the two copies
user1['who'] = 'Mike'
user2['who'] = 'Mikael'
# for demonstration, lets scale user2's readings
user2['reading'] *= 100
# and reorganize this to have a hierarchical row index
multi_user_sensor_data = pd.concat([user1,user2]).set_index(['who','interval','axis'])
multi_user_sensor_data


Out[51]:
                      reading
who    interval axis         
Mike   0        X         0.0
                Y         0.5
                Z         1.0
       1        X         0.1
                Y         0.4
...                       ...
Mikael 2        Y        30.0
                Z        80.0
       3        X        30.0
                Y        20.0
                Z        70.0

[24 rows x 1 columns]

In [52]:
# look up user data for Mike using just the index
multi_user_sensor_data.loc['Mike']


Out[52]:
               reading
interval axis         
0        X         0.0
         Y         0.5
         Z         1.0
1        X         0.1
         Y         0.4
...                ...
2        Y         0.3
         Z         0.8
3        X         0.3
         Y         0.2
         Z         0.7

[12 rows x 1 columns]

In [53]:
# readings for all users and axes at interval 1
multi_user_sensor_data.xs(1,level='interval')


Out[53]:
             reading
who    axis         
Mike   X         0.1
       Y         0.4
       Z         0.9
Mikael X        10.0
       Y        40.0
       Z        90.0

Unstacking will move the last level of the row index into a new level of the columns index resulting in columns having MultiIndex.


In [54]:
# unstack axis
multi_user_sensor_data.unstack()


Out[54]:
                reading             
axis                  X     Y      Z
who    interval                     
Mikael 0            0.0  50.0  100.0
       1           10.0  40.0   90.0
       2           20.0  30.0   80.0
       3           30.0  20.0   70.0
Mike   0            0.0   0.5    1.0
       1            0.1   0.4    0.9
       2            0.2   0.3    0.8
       3            0.3   0.2    0.7

In [55]:
# unstack at level=0
multi_user_sensor_data.unstack(level=0)


Out[55]:
              reading     
who            Mikael Mike
interval axis             
0        X        0.0  0.0
         Y       50.0  0.5
         Z      100.0  1.0
1        X       10.0  0.1
         Y       40.0  0.4
...               ...  ...
2        Y       30.0  0.3
         Z       80.0  0.8
3        X       30.0  0.3
         Y       20.0  0.2
         Z       70.0  0.7

[12 rows x 2 columns]

Multiple levels can be unstacked simultaneously by passing a list of the level to .unstack(). Additionally if the levels are named, they can be specified by name instead of location.


In [57]:
# unstack who and axis levels
unstacked = multi_user_sensor_data.unstack(['who','axis'])
unstacked


Out[57]:
         reading                              
who         Mike           Mikael             
axis           X    Y    Z      X     Y      Z
interval                                      
0            0.0  0.5  1.0    0.0  50.0  100.0
1            0.1  0.4  0.9   10.0  40.0   90.0
2            0.2  0.3  0.8   20.0  30.0   80.0
3            0.3  0.2  0.7   30.0  20.0   70.0

In [58]:
# and we can of course stack what we have unstacked
# this re-stacks who
unstacked.stack(level='who')


Out[58]:
                reading             
axis                  X     Y      Z
interval who                        
0        Mikael     0.0  50.0  100.0
         Mike       0.0   0.5    1.0
1        Mikael    10.0  40.0   90.0
         Mike       0.1   0.4    0.9
2        Mikael    20.0  30.0   80.0
         Mike       0.2   0.3    0.8
3        Mikael    30.0  20.0   70.0
         Mike       0.3   0.2    0.7

There are couple of points to be noticed here:

  • Stacking and unstacking always move the levels into the last levels of the other index. Note that the who level is now the last level of the row index, but started out earlier as the first level. This would have ramifications on the code to access elements via that index as it has changed to another level. If you want to put a level back into another position you need to reorganise the indexes with other means than stacking and unstacking.
  • With all this moving around of data, stacking and unstacking do not lose any information. They simply change the means by which it is organized and accessed.

Melting

Melting is a type of unpivoting, and is often referred as changing a DataFrame object from wide format to long format.

Technically, it is the process of reshaping a DataFrame into a format where two or more columns, referred to as variable and value are created by unpivoting column lables in the variable column and then moving the data from these columns into the appropriate location in the value column.


In [59]:
# we will demonstrate melting in this DataFrame
data = pd.DataFrame({'Name':['Mike','Mikael'],'Height':[6.1,5.9], 'Weight':[220,185]})
data


Out[59]:
   Height    Name  Weight
0     6.1    Mike     220
1     5.9  Mikael     185

In [60]:
# melt it, use Name as the id,
# Height and Weight columns as the variables
pd.melt(data,id_vars=['Name'],value_vars=['Height','Weight'])


Out[60]:
     Name variable  value
0    Mike   Height    6.1
1  Mikael   Height    5.9
2    Mike   Weight  220.0
3  Mikael   Weight  185.0

The data is now structured so that it is easy to extract the value for any combination of variable and Name. Additionally, when in this format it is easier to add a new variable and measurement as the data can simply be added as a new row instead of requiring a change of structure to DataFrame by adding a new column.

Performance Benefits


In [62]:
# stacked scalar access can be a lot faster than
# column access
# time the different methods
import timeit
t = timeit.Timer("stacked1[('one','a')]","from __main__ import stacked1, df")
r1 = timeit.timeit(lambda: stacked1.loc[('one','a')],number=10000)
r2 = timeit.timeit(lambda: df.loc['one']['a'],number=10000)
r3 = timeit.timeit(lambda: df.iloc[1,0],number=10000)

# and the results are
r1,r2,r3


Out[62]:
(19.58831945201382, 1.0793023779988289, 0.09692429000278935)